Goal: exploring and extracting insight from complex datasets
Important: don't post lengthy screenshots of your code publicly — use private post and we will respond
Exploratory Data Science: Students will be introduced to the main tools needed to get started analyzing and visualizing data using Python
Introduction to Geospatial Data Science: Building on the previous set of tools, this module will teach students how to work with geospatial datasets using a range of modern Python toolkits.
Data Ingestion & Big Data: Students will learn how to collect new data through web scraping and APIs, as well as how to work effectively with the large datasets often encountered in real-world applications.
Geospatial Data Science in the Wild: Armed with the necessary data science tools, students will be introduced to a range of advanced analytic and machine learning techniques using a number of innovative examples from modern researchers.
From Exploration to Storytelling: The final module will teach students to present their analysis results using web-based formats to transform their insights into interactive stories.
Homeworks will be assigned (roughly) every two weeks. You must complete five of the seven homework assignments. Four of the assignments are required, and you are allowed to choose the last assignment to complete (out of the remaining three options).
The final project is to replicate the pipeline approach on a dataset (or datasets) of your choosing.
Students will be required to use several of the analysis techniques taught in the class and produce a web-based data visualization that effectively communicates the empirical results to a non-technical audience.
very versatile: good for both exploratory data analysis and polished finished products
Note: as a free service, it can be a bit slow sometimes
To follow along, go to https://github.com/MUSA-620-Fall-2019/week-1
These slides are a Jupyter notebook.
A mix of code cells and text cells in Markdown. Change the type of cell in the top menu bar.
# A simple code cell
# SHIFT-ENTER to execute
x = 10
print(x)
# integer
a = 10
# float
b = 10.5
# string
c = "this is a test string"
# lists
d = list(range(10))
# booleans
e = True
# dictionaries
f = {'key1': 1, "key2": 2}
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
f = dict(key1=1, key2=2, key3=3)
# access the value with key 'key1'
f['key1']
# access the second list entry (0 is the first index)
d[1]
# the first character
c[0]
# Python code
result = 0
for i in range(100):
result = result + i
print(result)
a = range(10) # this is an iterator
print(a)
# convert it to a list explicitly
a = list(range(10))
print(a)
# or use the INLINE syntax; this is the SAME
a = [i for i in range(10)]
print(a)
def function_name(arg1, arg2, arg3):
.
.
.
code lines (indented)
.
.
.
return result
def compute_square(x):
return x * x
sq = compute_square(5)
print(sq)
def compute_product(x, y=5):
return x * y
# use the default value for y
print(compute_product(5))
# specify a y value other than the default
print(compute_product(5, 10))
# can also explicitly tell Python which arguments are which
print(compute_product(5, y=2))
print(compute_product(x=5, y=2))
# argument names must match the function signature though!
print(compute_product(5, z=5))
Use tab auto-completion and the ? and ?? operators
this_variable_has_a_long_name = 5
# try hitting tab after typing this_
this_variable_has_a_long_name
# try typing "r" and then tab
# Forget how to create a range? --> use the help message
range?
Use the ?? operator
compute_product??
The question mark operator gives you access to the help message for any variable or function. Very useful
This was a very brief introduction. Additional Python tutorials listed on this week's README and on the course materials page.
The The Python Data Science Handbook is a free, online textbook covering the Python basics needed in this course. In particular, the first four chapters are excellent:
Note that you can click on the "Open in Colab" button for each chapter and run the examples interactively using Google Colab.
The following line imports the pandas package:
import pandas as pd
The primary objects in pandas are the:
DataFrame, which is very similar to an Excel spreadsheet, and has rows and named columnsSeries, which represents a single column of data. A DataFrame contains one or more Series and a name for each Series.The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in R.
You can think Series objects as fancier versions of Python's built-in list data type
To create a Series object:
# city names and population
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
city_names
DataFrame objects can be created by passing a dict mapping string column names to their respective Series.
df = pd.DataFrame({ 'City Name': city_names, 'Population': population })
df
# access columns with a dict-like syntax
df['Population']
# list slicing: get the elements with indices 1 and 2 (but NOT 3)
cities = ['San Francisco', 'San Jose', 'Sacramento']
cities[1:3]
# slicing data frame rows is very similar!
df[1:3]
pandas includes functionality for many different ways of selecting data. See the documentation for many more examples.
df['Population'].median()
NumPy is a popular toolkit for scientific computing.
pandas Series can be used as arguments to most NumPy functions:
import numpy as np
# calculate the median population value
np.median(df['Population'])
For more complex single-column transformations, you can use Series.apply. It accepts a function that is applied to each value in the Series.
For example, we can find which cities have a population greater than a million:
# define our function
def get_large_cities(population):
return population > 1e6
large_cities = df['Population'].apply(get_large_cities)
large_cities
# add the new computed column to our original data frame
df['Large Cities'] = large_cities
# we can also use lambda (unnamed, inline) functions
df['Small Cities'] = df['Population'].apply(lambda population: population < 1e6)
# print out
df
We can select the "large" cities by passing the boolean values to the .loc() function of a DataFrame:
df['Large Cities']
df.loc[df['Large Cities']]
df['Population'] > 1e6
# this is equivalent to doing
df.loc[df['Population'] > 1e6]
The documentation is your best friend! How to get help?
# use the question mark
pd.DataFrame.loc?
Let's load census data on rentals rates from the 2017 Amercian Community Survey.
We've stored the data locally in the comma-separated value (CSV) format. Later in the course, you'll learn how to download them directly using Python.
Use the read_csv() function from pandas. The first argument to read_csv() is the file path to load.
df = pd.read_csv("./data/census/ACS_17_1YR_B25003.csv")
df.head()
Select certain columns and rename them:
# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
df2 = df[columns].copy()
# rename columns
df2.columns = ['City', 'Total', 'Rentals']
df2.head()
df2['Rental Rate'] = df2['Rentals'] / df2['Total'] * 100
df2.head()
len(df2)
We can select the largest cities by population using the Series.isin() function, which checks if each value in the Series is in the specified list.
top_cities = ['Philadelphia city, Pennsylvania',
'New York city, New York',
'Los Angeles city, California',
'Houston city, Texas',
'Chicago city, Illinois',
'Phoenix city, Arizona']
#selection = # which rows are valid
df2.loc[df2['City'].isin(top_cities)] # select the valid rows
Python uses the "%" operator to insert variable values into strings. For example, the file name of the data we want to load for 2017 is:
year_str = "17"
path = "./data/census/ACS_%s_1YR_B25003.csv" % year_str
print(path)
See this guide on this type of string formatting for more info.
Python version 3.6 introduced a new and improved string formatting syntax, where the variable that is being inserted can be referenced directly using the variable name.
The variable name should be enclosed in curly braces inside the string:
year_str = "17"
path = f"./data/census/ACS_{year_str}_1YR_B25003.csv"
print(path)
For more info: see this guide
data = []
for year in range(2005, 2018):
# read data for this year
year_str = str(year) # convert integer value of "year" to a string
year_str = year_str[2:] # extract the last two digits of the year string
df = pd.read_csv(f"./data/census/ACS_{year_str}_1YR_B25003.csv")
# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
df = df[columns]
# rename columns
df.columns = ['City', 'Total', 'Rentals']
# calculate the rental rate
df['Rental Rate'] = df['Rentals'] / df['Total'] * 100
# select the cities we want
selection = df['City'].isin(top_cities)
df = df.loc[selection]
# add a column for the year
df['Year'] = year
# save
data.append(df)
# combine all of the data frames into one along the row axis
data = pd.concat(data, axis=0)
data.head()
matplotlib¶Much more to come next week. For now, we'll use some very simple plotting commands with Python's main plotting libary matplotlib.
Import the library we'll use from matplotlib.
from matplotlib import pyplot as plt
We'll use the plt.plot() function which plots a simple x vs y line.
with plt.style.context('fivethirtyeight'):
for city in data['City'].unique():
print(city)
this_city = data['City'] == city
df = data.loc[this_city]
plt.plot(df['Year'], df['Rental Rate'], label=city)
# format
plt.legend(loc=0, ncol=3)
ax = plt.gca()
ax.set_ylim(30, 72)
ax.set_ylabel("Rental Rate in Percent")
ax.figure.set_size_inches((10, 6))
Exercise: can you calculate what the increase was for Philadelphia?
# trim to just Philadelphia first, using this boolean selection index
data['City']=='Philadelphia city, Pennsylvania'
# select Philadelphia only
philly = data.loc[data['City']=='Philadelphia city, Pennsylvania']
philly
Note how the first row of the above dataframe has an index label of 393. This means we can access it using the .loc[] function like:
philly.loc[393]
This is different than the .iloc[] function, which also indexes based on the integer value of the row, regardless of what the labels in the index are. For example, the first row of a dataframe can always be accessed using:
philly.iloc[0]
We can reset the index labels so they range from 0 to the length of the dataframe, using the reset_index() function. For example
philly.reset_index(drop=True)
Next, to calculate our percent difference we need to get the values for Philadelphia in 2005 and 2017:
# select 2005
x = philly.loc[philly['Year']==2005]
value_2005 = x['Rental Rate'].iloc[0]
print("2005 value = ", value_2005)
x['Rental Rate']
# select 2017
y = philly.loc[philly['Year']==2017]
value_2017 = y['Rental Rate'].iloc[0]
print("2017 value = ", value_2017)
# calculate percent change
percent_change = (value_2017 - value_2005)/ value_2005 * 100
print("change in rental rate in Philadelphia from 2005 to 2017 = ", percent_change)
For more details on the iloc() function, see the documentation on indexing by position.
Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.
Files were downloaded earlier from https://www.zillow.com/research/data/
home_values = pd.read_csv("./data/zillow/Metro_Zhvi_AllHomes.csv", engine='python')
rent_values = pd.read_csv('./data/zillow/Metro_Zri_SingleFamilyResidenceRental.csv', engine='python')
Peek at the first few rows of the ZRI data:
rent_values.head()
And do the same for the ZHVI data:
home_values.head()
valid_cities = ['New York, NY', 'Chicago, IL',
'Los Angeles-Long Beach-Anaheim, CA',
'Philadelphia, PA', 'Houston, TX', 'Phoenix, AZ']
selection = home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
rent_values_trimmed
Unwanted columns can be dropped from the data frame using the drop() function.
unwanted = ['SizeRank', 'RegionID']
home_values_final = home_values_trimmed.drop(unwanted, axis=1)
unwanted = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(unwanted, axis=1)
rent_values_final
Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.
home_values_final
Usually it's better to have data in tidy (also known as long) format.
Tidy datasets are arranged such that each variable is a column and each observation is a row.
In our case, we want to have a column called ZRI and one called ZHVI and a row for each month that the indices were measured.
pandas provides the melt() function for converting from wide formats to tidy formats.
pd.melt?
Now, let's melt our datasets:
ZHVI = pd.melt(home_values_final, id_vars=['RegionName'],
value_name='ZHVI', var_name='Date')
ZRI = pd.melt(rent_values_final, id_vars=['RegionName'],
value_name='ZRI', var_name='Date')
and take a look:
ZRI.head()
ZHVI.head()
Note that missing data values are represented as NaN ("not a number")
Another common operation is merging, also known as joining, two datasets.
We can use the merge() function to merge observations that have the same Date and RegionName values.
data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
data
Merging is very powerful and the merge can be done in a number of ways. See the infographic on joining in this repository.
Currently our Date column is stored as a string.
pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime() function.
# convert to data
data['Date'] = pd.to_datetime(data['Date'])
Quick trick: Series that hold Datetime objects have a dt attribute that let's you grab parts of the date easily.
For example, we can easily add new columns for the month and year using:
# Note the the dtype is now datetime64[ns]
data['Date'].head()
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year
data.head()
Year¶pandas is especially useful for grouping and aggregating data via the groupby() function.
From the pandas documentation, groupby means:
The documentation is available here.
We can calculate annual averages for each year by grouping by the RegionName and Year columns and taking the mean of our desired column. For example:
# calculate mean values for each Year and City (RegionName)
annual_ZHVI = data.groupby(['RegionName', 'Year'])['ZHVI'].mean()
annual_ZRI = data.groupby(['RegionName', 'Year'])['ZRI'].mean()
print(type(annual_ZHVI))
annual_ZHVI.head()
Note that here that the result is indexed by the columns we grouped by (RegionName and Year).
We can reset the index so that the index values are listed as columns in the data frame again.
annual_ZHVI = annual_ZHVI.reset_index()
annual_ZRI = annual_ZRI.reset_index()
annual_ZHVI.head(n=50)
with plt.style.context('fivethirtyeight'):
for city in annual_ZHVI['RegionName'].unique():
selection = annual_ZHVI['RegionName'] == city
df = annual_ZHVI.loc[selection]
plt.plot(df['Year'], df['ZHVI']/1e3, label=city)
ax = plt.gca()
ax.set_ylim(50, 800)
ax.legend(loc=0, ncol=2)
ax.figure.set_size_inches((10, 6))
ax.set_ylabel('Zillow Home Value Index\n(in thousands of dollars)')
with plt.style.context('fivethirtyeight'):
for city in annual_ZRI['RegionName'].unique():
selection = annual_ZRI['RegionName'] == city
df = annual_ZRI.loc[selection]
plt.plot(df['Year'], df['ZRI'], label=city)
ax = plt.gca()
ax.set_ylim(1000, 3300)
ax.legend(loc=0, ncol=2)
ax.set_ylabel('Zillow Rent Index (in dollars)')
ax.figure.set_size_inches((10, 6))
Available on GitHub:
Two parts:
Due date: by the start of next class